Gersten, R. M., Clarke, B., Jordan, N., Newman- 

Gonchar, R., Haymond, K., &c Wilkins, C. (2012). 
Universal screening in mathematics for the primary 
grades: Beginnings of a research base. Exceptional 
Children , 78 , 423-445. Retrieved from 
http://cec.metapress.com/content/B75U2072576416 
T7 

Grant/Contract Number: R324A120304 

This research was supported by the ROOTS Project, Grant No. R324A120304, funded by the 
U.S. Department of Education, Institute of Education Sciences. The opinions expressed are those 
of the authors and do not represent the views of the Institute or the U.S. Department of 
Education. 



Vol. 78, No. 4, pp. 423-445. 

©2012 Council for Exceptional Children. 


Exceptional Children 


Universal Screening in 
Mathematics for the Primary 
Grades: Beginnings of 
a Research Base 


RUSSELL GERSTEN 

Instructional Research Group 

BEN CLARKE 

University of Oregon 

NANCY C. JORDAN 

University of Delaware 

REBECCA NEWMAN-GONCHAR 

KELLY HAYMOND 

Instructional Research Group 

CHUCK WILKINS 

Edvance Research, Inc. 


abstract: This article describes key findings from contemporary research on screening for early 
primary grade students in the area of mathematics. Existing studies were used to illustrate the con¬ 
structs most worth measuring and the diverse strategies that researchers used to study potential 
measures. The authors discussed the strengths and weaknesses of assessing a few key proficiencies (as 
is often done in early reading) versus a more full-scale battery, and described the importance of 
going beyond merely reporting predictive validity correlation coefficients to examining the classifi¬ 
cation accuracy, specificity, and sensitivity of screening measures. 


R ecent longitudinal research 
strongly suggests that students 
who perform poorly on sim¬ 
ple mathematics problems at 
the end of kindergarten and 
first grade are likely to continue to perform 
poorly in mathematics through fourth grade 
(Duncan et al., 2007; Jordan, Kaplan, Ramineni, 


& Locuniak, 2009; Morgan, Farkas, & Wu, 
2009). In fact, using a nationally representative 
sample of students, Morgan et al. (2009) found 
that students who remained in the lowest 10th 
percentile at both the beginning and end of 
kindergarten (often considered an indicator of a 
learning disability in mathematics) had a 70% 
chance of remaining in the lowest 10th percentile 
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5 years later. They also tended to score, on aver¬ 
age, two standard deviation units (48 percentile 
points) below students who were in the acceptable 
range of mathematics performance in kinder¬ 
garten. Jordan et al. (2009) found that kinder- 
gartners’ number sense, that is, their knowledge 
of number relationships and the meaning of 
number concepts, predicts later mathematics 
achievement even when statistically controlling 
for IQ and socioeconomic status. 

Just as the persistence of reading disabilities 
stimulated widespread investment in early inter¬ 
vention and screening in reading, we hope that 
the concurrent findings for the persistence of 
mathematics difficulties will incite similar leaps 
for identifying measures to screen students likely 
to experience difficulties in mathematics. In the 
past decade, a number of mathematics screening 
measures for use in the primary grades have been 
developed. We can draw some reliable conclu¬ 
sions from the convergent findings of these 
efforts. Using extant studies of early screening in 
mathematics, we illustrate principles and draw 
attention to issues to help professionals under¬ 
stand what mathematics constructs measure as 
well as strengths and weaknesses of contemporary 
screening measures. 

Additionally, we examine a critical, but rarely 
explored, issue in research on screening in educa¬ 
tion: classification accuracy, that is, the precision 
with which measures accurately detect which stu¬ 
dents will have trouble in mathematics without 
intensive intervention. Because this area of re¬ 
search is underdeveloped in mathematics, this ar¬ 
ticle will describe the concept of classification 
accuracy, specifically sensitivity and specificity, 
and their relationship to decisions educators must 
make when selecting a screening measure. 

APPROACHES TAKEN 
TOWARDS MEASURING 
NUMBER PROFICIENCY 
IN YOUNG STUDENTS 

Virtually all math screening measures for the pri¬ 
mary grades rely on assessing aspects of what is 
often referred to as number sense. Okamoto and 
Case (1996) describe number sense as the devel¬ 
opment of increasingly sophisticated understand¬ 


ing of numbers and understanding that is typi¬ 
cally represented by students’ ability to use 
increasingly sophisticated mental number lines. 
Individuals with good number sense appear to 
develop a mental number line on which they can 
represent and manipulate numerical quantities. 
However, number sense is more complex than the 
development of a mental number line. Berch 
(2005) captures the complexities of articulating a 
working definition of number sense, remarking, 
“Possessing number sense ostensibly permits one 
to achieve everything from understanding the 
meaning of numbers to developing strategies for 
solving complex math problems; from making 
simple magnitude comparisons to inventing pro¬ 
cedures for conducting numerical operations” (p. 
334). For that reason, the National Research 
Council (2009) recommended use of the term 
number proficiencies to refer to the specific com¬ 
ponents of number sense that are the focus of an 
assessment or an intervention. Because both terms 
have been used historically, we use both in this 
article. 

Researchers have adopted an array of 
approaches for developing assessments of early 
number proficiency/number sense. The first ap¬ 
proach attempts to develop efficient screening 
measures so that schools can discern which stu¬ 
dents are likely to require additional assistance. 
Many of these researchers (e.g., Lembke & Foe- 
gen, 2009) have focused on the development of a 
set of brief, timed measures, each of which gauges 
one key aspect of number competence. Typically, 
the research team calculates predictive validity 
indices for each individual test. Recently, 
researchers have used some form of ordinary least 
squares regression to develop a composite score 
(often the sum or average of each individual mea¬ 
sure) that predicts subsequent mathematics 
achievement. The second approach for developing 
screening measures includes the development of 
one measure of number proficiency that inten¬ 
tionally samples across several different aspects of 
number proficiency. Examples are the research of 
Bryant, Bryant, Gersten, Scammacca, and Chavez 
(2008) and the recent research of Jordan, Glut¬ 
ting, and Ramineni (2010). 
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Procedure for Reviewing the 
Literature on Early Screening in 
Mathematics 

We conducted a literature review using the ERIC 
and PsycINFO databases. We used the descriptors 
screening and mathematics and limited our search 
to empirical studies published between 1996 and 
2011. We limited our search to studies involving 
children ranging in age from birth to 12 years old 
and excluded dissertations. We also conducted a 
manual search of major journals in special, reme¬ 
dial, and elementary education (Journal of Special 
Education, Exceptional Children, Journal of Educa¬ 
tional Psychology, and Journal of Learning Disabili¬ 
ties) to locate relevant studies. 

This search resulted in the identification of 
48 studies. Of this total, 21 studies were selected 
for further review based on analysis of the title, 
abstract, and keywords. Of these 21 studies, 16 
(76%) met our criteria for inclusion. Out of the 
16 studies identified, eleven focused on single 
proficiency measures, four included multiple pro¬ 
ficiency measures, and five used diagnostic utility 
statistics and receiver operating characteristics 
(ROC) analyses to predict mathematics learning 
disability (MLD) or low-achieving students. 
(Seethaler and Fuchs, 2010, used a single profi¬ 
ciency measure, a multiple proficiency measure, 
and diagnostic utility statistics. Geary, Bailey, and 
Hoard, 2009, used a single proficiency measure 
and diagnostic utility statistics. Clarke et ah, 
2011, used a multiple proficiency measure and di¬ 
agnostic utility statistics.) 

Our criteria for inclusion limited our review 
to studies that targeted kindergarten and first 
grade students, included screening measures and 
outcome variables specific to mathematics perfor¬ 
mance, and reported predictive validity, ROC 
curves, or sensitivity and specificity analyses. Two 
of the studies, Locuniak and Jordan (2008) and 
Bryant et al. (2008), included both a first- and a 
second-grade sample. Although we limit the data 
presented in the tables to first grade only, we dis¬ 
cuss the second-grade sample in the text of this 
article. We excluded studies that used one or 
more norm-referenced standardized measures as a 
screener because we were interested in an efficient 
screener or screening batteries. Many of the stan¬ 
dardized measures are much longer than we 


would recommend for a screener, often taking be¬ 
tween 1 hr and 3 hr. 

For the data presented in Tables 1 and 2, we 
focused on studies that provided correlations be¬ 
tween screeners administered in the fall and 
mathematics outcomes administered in the spring 
of that same year (in one case, we included a 
screener given in the spring of the preceding year 
since this is a practice that some districts use). For 
the tables specifically, we compared measures 
across a similar time frame because most schools 
screen students in the fall as a means of predicting 
who is likely to perform poorly at the end of the 
year without receiving additional assistance. How¬ 
ever, we did include long-range prediction studies 
in Table 3 regardless of when the screener was ad¬ 
ministered (fall, winter, or spring) to include 
studies using recent advances in predictive validity 
methods. 

Measures of Critical Aspects of 
Number Proficiency 

Most research on screening measures in early pri¬ 
mary grades (e.g., Lembke & Foegen, 2009; 
Methe, Hintze, & Floyd, 2008) has focused on 
discrete proficiencies, rather than deficiencies—an 
approach that seems more appropriate for univer¬ 
sal screening measures. Although these measures 
are not designed to be comprehensive, when done 
well, their results may be related to students’ per¬ 
formance on other critical aspects of mathematics. 
For example, a good measure of magnitude com¬ 
parison may serve as an indicator of likely perfor¬ 
mance in place value or mental calculation. Most 
of these measures are easy to administer and typi¬ 
cally take from as little as 1 min to 5 min to com¬ 
plete. Such measures could be used to quickly 
identify students whose mathematics achievement 
is either on track or at risk in one or more critical 
areas related to development of number sense/ 
number proficiency, the most critical component 
of the early elementary grade mathematics cur¬ 
riculum (National Mathematics Advisory Panel, 
2008). 

However, as with any screening measure, 
these brief measures cannot provide a full diag¬ 
nostic profile. As shown in Table 1, the predictive 
validity correlations are typically reasonable, but 
not high, and often not as high as comparable 
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TABLE 1 

Predictive Validity of Screening Measures for the Primary Grades 
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Diagnostic Utility Statistics and Receiver Operating Characteristics (ROC) 
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and 44 on Number Sense; False Negatives = 6 on QD, 5 on CF, and 7 on Number Sense; True Positives = 53 on QD, 54 on CF, and 52 on Number Sense; False Positives 
93 on QD, 88 on CF, and 93 on Number Sense. 





measures in elementary school reading (Foegen, 
Jiban, & Deno, 2007). In addition, these mea¬ 
sures are individually administered (at least at the 
current point in time). Although this mode may 
be preferable in kindergarten and the beginning 
of first grade, it is far more burdensome than 
computer-administered assessments or even pen¬ 
cil and paper assessments. 

Four components of number sense/number 
competence deemed most important by cognitive 
psychologists include: (a) magnitude comparison 
(Booth & Siegler, 2006), (b) strategic counting 
(Geary, 2004), (c) the ability to solve simple word 
problems (Jordan et al., 2009), and (d) retrieval 
of basic arithmetic facts (Jordan, FFanich, & Ka¬ 
plan, 2003). Table 1 provides descriptions of key 
elements of the literature base among the four 
components of number sense/number compe¬ 
tence. 

Magnitude Comparison. Magnitude compari¬ 
son is the ability to discern which number is the 
greatest in a set, and to be able to weigh relative 
differences in magnitude efficiently (e.g., to know 
that 11 is a bit bigger than 9, but 18 is a lot big¬ 
ger than 9). As children develop a more sophisti¬ 
cated understanding of number and quantity, 
they are able to make increasingly complex judg¬ 
ments about magnitude. Riley, Greeno, and 
Heller (1983) found that, given a hypothetical 
scenario with a picture of five birds and one 
worm, most preschoolers could answer questions 
such as, “Suppose the birds all race over and each 
one tries to get a worm. Will every bird get a 
worm?” Their answers demonstrate a gross mag¬ 
nitude judgment that there are more birds than 
worms. But given a specific question about mag¬ 
nitude, for example, “How many birds won’t get a 
worm?” (p. 169), most preschoolers could not an¬ 
swer correctly. The ability to make these finite 
types of magnitude comparisons is a critical un¬ 
derpinning of the ability to calculate, and, in the 
view of Okamoto and Case (1996) as well as 
Booth and Siegler (2006), represents the evolu¬ 
tion of an increasingly sophisticated and accurate 
mental number line, as discussed above. 

A number of research teams have designed 
and tested similar measures of magnitude com¬ 
parison for kindergarten and first grade (See Table 
1). All measures included a timed element, but 
the range of numbers used in the materials varied 


in response to potential concerns about floor or 
ceiling effects. One of the first efforts to develop a 
measure of magnitude comparison was by Clarke 
and Shinn (2004), who tested a timed magnitude 
comparison measure with first-grade students 
using fall screening to predict performance on the 
Woodcock-Johnson Psychoeducational Battery- 
Revised (WJ-R) Applied Problems subtest 
(Woodcock & Johnson, 1989). Predictive validity 
was .79, which is quite high. Clarke, Baker, 
Smolkowski, and Chard (2008) extended the 
work to a kindergarten sample, only including 
numbers between 1 and 10, rather than 1 and 20. 
Predictive validity was .62 with a standardized 
achievement test. 

Table 1 presents additional data on predictive 
validity of magnitude comparison measures. Co¬ 
efficients were fairly consistent, with a median of 
.62 for first grade and .50 for kindergarten. Many 
studies suffer from the limitation of using only 
one site, with the exception of Clarke, Gersten, 
Dimino, and Rolfhus (2012) and Lembke and 
Foegen (2009). However, the fact that kinder¬ 
garten and first-grade findings were replicated 
across multiple studies in multiple sites does indi¬ 
cate great promise for a timed measure of magni¬ 
tude comparison. 

Strategic Counting. The ability to understand 
how to count efficiently and use counting strate¬ 
gies is fundamental to developing mathematical 
understanding and proficiency (Siegler & Robin¬ 
son, 1982). Geary (2004) notes that weak ability 
in counting strategies is a key indicator of which 
young students are likely to have difficulty learn¬ 
ing mathematics. In most cases, competence in 
counting strategies is strongly related to burgeon¬ 
ing knowledge of number properties. Once a 
child possesses the “count on” strategy, if asked 
“what is 9 more than 2?” she will automatically 
know that it is much more efficient to reverse the 
problem to 2 more than 9, and simply “count on” 
from 9. Counting on from the larger addend is 
important for learning addition and subtraction 
number combinations, and grasping the count on 
strategy demonstrates the beginnings of a grasp of 
the commutative property of addition. 

A number of researchers have developed 
strategic counting measures that require students 
to identify the missing number from a sequence 
of numbers. All measures included a timed ele- 
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ment, but different ranges of numbers were used. 
The Clarke et al. (2008) Missing Number kinder¬ 
garten measure used numbers between 1 and 10, 
and their first-grade measure used numbers be¬ 
tween 1 and 20. Lembke and Foegen (2009) de¬ 
veloped a measure of strategic counting in which 
students were given 1 min to identify a missing 
number from a sequence of four consecutive 
numbers. They also included items (20% of the 
items) that assessed strategic counting by 5s and 

10s (e.g., 5 10_20). The range of numbers used 

was up to 20 for count by Is, up to 50 for count 
by 5s, and up to 100 for count by 10s. Unlike 
other researchers, they used the same items for 
kindergarten and first grade. The predictive valid¬ 
ity was weak for kindergarten (.37), in fact the 
weakest in the set of studies. In contrast, it was 
moderately strong for first grade (.68), suggesting 
that kindergartners and first graders require dif¬ 
ferent sets of items in screening measures. 

Each research team found moderate concur¬ 
rent and predictive validities (range = .37-.68) 
and strong reliabilities (range = .59—.98). Specifi¬ 
cally, the predictive validities are quite high for 
first grade with a median coefficient of .62, but in 
the low to moderate range for kindergarten with a 
median of .475. This measure thus does not seem 
a suitable screener for the beginning of kinder¬ 
garten. 

Word Problems Involving Simple Arithmetic 
Operations. Jordan, Levine, and Huttenlocher 
(1994) found that although adults often think 
that children have a hard time solving word prob¬ 
lems, young children, in fact, find them easier 
than even simple number sentences. In other 
words, before formal schooling, children can 
much more easily tell you how many sheep are 
left if you start out with 9 and lose 2 than they 
can tell you that 9 minus 2 is 7. 

For that reason, simple word problems have 
been added to early screening batteries in recent 
years (Fuchs et ah, 2007; Locuniak & Jordan, 
2008). Locuniak and Jordan created a simple 
eight-item story problem measure with four addi¬ 
tion and four subtraction story problems. Perfor¬ 
mance on the story problem measure in the fall of 
kindergarten was moderately related to perfor¬ 
mance on a measure of calculation fluency at the 
end of second grade (.51), quite high for predict¬ 
ing performance over the course of 3 school years. 


Performance was less related between the word 
problem measure and the digital span forward 
and backward on the Wechsler Intelligence Scale 
for Children-IV (WISC-IV; Wechsler, 2003). 

Retrieval of Basic Arithmetic Facts. Some of 
the earliest research on mathematics difficulties 
focused on correlates of students in the upper ele¬ 
mentary grades who were identified as demon¬ 
strating a learning disability by school personnel. 
One consistent finding (Goldman, Pellegrino, & 
Mertz, 1988) was that students who struggled 
with mathematics in the elementary grades were 
unable to automatically retrieve addition and sub¬ 
traction number combinations. Research seems to 
indicate that although students with learning dis¬ 
abilities in mathematics often make good strides 
in terms of facility with algorithms, procedures, 
and simple word problems, severe deficits remain 
in their retrieval of basic combinations (Geary, 
2004; Jordan et al., 2003). These deficiencies sug¬ 
gest underlying problems with what Geary calls 
semantic memory (i.e., the ability to store and re¬ 
trieve abstract information efficiently). This abil¬ 
ity appears to be critical for students to succeed in 
mathematics and, ultimately, to understand 
mathematics. Jordan et al. (2003), however, argue 
that poor fact retrieval has its roots in weak num¬ 
ber sense. It is difficult for children to become au¬ 
tomatic with addition and subtraction number 
combinations when they do not have a good 
sense of relations between and among numbers 
and operations. The ability to solve number com¬ 
binations involving addition and subtraction, 
even at the beginning of kindergarten, is consid¬ 
ered a powerful predictive measure of mathemat¬ 
ics achievement through third grade (Jordan et 
al., 2009). 

It is difficult for children to become 
automatic with addition and subtraction 
number combinations when they do not 
have a good sense of relations between and 
among numbers and operations. 

Measures of fact retrieval appear to be 
promising based on one study by Bryant et al. 
(2008), who designed an addition and subtrac¬ 
tion fact screener for first- and second-grade 
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students. Students had 1 min to complete basic 
facts problems. Concurrent validity correlations 
with the Stanford Achievement Test-Tenth Edi¬ 
tion (SAT-10; Pearson, 2003) were .55 for first- 
grade students and .59 for second-grade students. 
These data suggest that a fact retrieval measure 
would be a sensible addition to a screening bat¬ 
tery in the second grade and possibly first grade. 
In fact, recent research (Clarke et al., 2012) exam¬ 
ined the predictive validity of a timed fact re¬ 
trieval measure with the widely used mathematics 
achievement test Terra Nova (CTB/McGraw-Hill, 
2008) for first-grade students and found a corre¬ 
lation coefficient of .50. 

Each of the four competencies discussed pre¬ 
viously appears reasonable for use in early screen¬ 
ing. Each of the measures discussed is brief and 
easy to administer, important characteristics of a 
screening measure. 

MULTIPLE NUMBER 
PROFICIENCY MEASURES 

Another promising approach is the use of mea¬ 
sures that cover multiple, but related number 
competencies that young children need in order 
to be successful in mathematics. Much of the re¬ 
search in this area is quite new, but appears 
equally promising as the single proficiency mea¬ 
sures. Research to date demonstrates that mea¬ 
sures encompassing multiple aspects of number 
competence, such as the Number Knowledge Test 
(NKT; Okamoto & Case, 1996) and Number 
Sense Brief (Jordan, Glutting, & Ramineni, 
2008), tend to demonstrate somewhat stronger 
predictive validity than the briefer, single profi¬ 
ciency measures. Table 2 provides a description of 
multiple number proficiency test studies and their 
reported predictive validity. 

The NKT is an individually administered 
measure that is one of the earliest attempts to as¬ 
sess students’ procedural and conceptual knowl¬ 
edge related to whole numbers. The NKT 
includes a number of the critical proficiencies de¬ 
scribed previously, such as the ability to make 
magnitude comparisons, count, and use basic 
arithmetic operations in multiple formats includ¬ 
ing word problems that are read to the student. 
The NKT takes about 10 to 15 min to administer 


and consists of four levels of increasing difficulty. 
For example, children at the second level compare 
the numbers 5 and 4 and identify the bigger 
number. The same problem type is presented at 
the third level using the numbers 19 and 21. 

Baker et al. (2002) explored the ability of the 
NKT administered at the end of kindergarten to 
predict mathematics achievement on the Stanford 
Achievement Test-Ninth Edition (SAT-9; Pearson, 
1996) at the end of first grade. The predictive va¬ 
lidity coefficient of the NKT was .73 for Total 
Mathematics. Note that this lengthier multiple- 
proficiency measure tends to demonstrate slightly 
higher predictive validity than many of the briefer 
measures discussed earlier. 

Finally, item response theory (IRT) was used 
to establish the internal consistency and reliability 
of the NKT and to examine the extent to which 
the four levels of the NKT fit item difficulties. 
The IRT reliability was .93. Descriptive analyses 
revealed that most of the items fit the levels estab¬ 
lished by Okamoto and Case (1996), although a 
few were misplaced and a paucity of items were at 
the easy level of difficulty, indicating that the 
measure would not necessarily be sensitive to 
growth for students at the lower end of the distri¬ 
bution. 

More recently, Jordan et al. (2008) developed 
a screening battery based on the same theoretical 
and empirical underpinnings of the Locuniak-Jor- 
dan research but much more brief and efficient, 
with an administration time of approximately 15 
min. Test-retest reliability ranged from .61 to .86 
with a predictive validity of .63 from kindergarten 
administration to student mathematics achieve¬ 
ment, measured by the Woodcock-Johnson III 
Tests of Achievement (WJ-III; Woodcock, Mc- 
Grew, & Mather, 2001) in third grade. 

Seethaler and Fuchs (2010) administered a 
single proficiency measure, a magnitude compari¬ 
son (Chard et al., 2005), and a multiple profi¬ 
ciency measure (the Number Sense, created by 
the authors) in September and May of kinder¬ 
garten. At the end of first grade, conceptual and 
procedural outcomes were measured on The Early 
Math Diagnostic Assessment (EMDA; The Psy¬ 
chological Corporation, 2002) and the KeyMath- 
Revised (KM-R; Connolly, 1998). Comparisons 
of single and multiple proficiency screeners, fall 
versus spring kindergarten screening, and concep- 
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tual versus procedural outcomes were conducted 
using logistic regression and ROC analyses. Re¬ 
sults indicated that single and multiple profi¬ 
ciency screeners produced good and similar 
classification accuracy at the fall and spring 
screening occasions on the conceptual outcome. 
Interestingly, each of their screeners classified fu¬ 
ture conceptual math difficulties (MD) status 
with significantly greater accuracy than future 
procedural MD status. During the fall of kinder¬ 
garten, the area under the curve (AUC) for the 
three screeners ranged from .80 to .86 for the 
EMDA Math Reasoning subtest, indicating good 
predictive utility for conceptual MD status, 
whereas AUC for the EMDA Numerical Opera¬ 
tions subtest ranged from .67 to .69, indicating 
poor predictive utility for procedural MD status. 
One possibility is that strategic counting would 
theoretically seem to be a strong predictor of 
computational proficiency. Another possible 
explanation, proposed by the authors, is that 
numerical operations are not stressed in kinder¬ 
garten and first grade, whereas foundational 
mathematical concepts are heavily stressed. 

SCREENERS ALIGNED TO 
CURRICULUM STANDARDS 

A different approach to screening has been devel¬ 
oped by Fuchs and colleagues (e.g., Fuchs, Fuchs, 
& Zumeta, 2008) and Clarke and colleagues 
(Clarke et ah, 2012; Clarke et al., 2011). Typi¬ 
cally, this screening approach consists of a group 
administered paper and pencil test containing 
items that represent the current year’s curricula 
scope and sequence (derived, for example, from 
current state standards or National Council of 
Teachers of Mathematics [NCTM] Focal Points). 
A strength of these measures is that they can be 
quickly administered and scored by machine. 
They also possess strong face validity because of 
their focus on key curriculum topics for the cur¬ 
rent year. Typically, these measures demonstrate 
acceptable test-retest, inter-rater, and alternate 
form reliability above .80. The concurrent and 
predictive validities of these measures are between 
.50 and .60 (See Foegen, et al. [2007] for an ex¬ 
tensive review). There are some instances of their 


use in first grade (e.g., Seethaler & Fuchs, 2010 
and Clarke et al., 2011). 


OTHER MEASURES TO 
CONSIDER INCLUDING IN 
A MORE COMPREHENSIVE 
SCREENING BATTERY 

One might think that only mathematics measures 
should be used in screening for potential mathe¬ 
matics difficulties. However, recent research on 
early identification of students with problems 
learning mathematics has discovered that working 
memory and student engagement may also be 
useful in predicting problems in mathematics. For 
that reason, we discuss both in this article. 

Working Memory 

Recent syntheses of the literature on mathematics 
disabilities (e.g., Desoete, Ceulemans, Roeyers, & 
Huylebroeck, 2009; Geary, 2004) observe that, in 
addition to problems with magnitude compari¬ 
son, counting strategies, and computational 
strategies, these students often display deficits in 
working memory (Geary, Hoard, Byrd-Craven, 
Nugent, & Numtee, 2007; Swanson & Beebe- 
Frankenberger, 2004) and problems with visual- 
spatial memory and elaboration (e.g., Geary, 
2004). Contemporary research consistently 
demonstrates the importance of working memory 
(Baker et ah, 2002; Locuniak & Jordan, 2008; 
Swanson & Beebe-Frankenberger, 2004) in un¬ 
derstanding mathematical proficiency at many 
different age levels. Working memory is often 
measured by a reverse digit span task, that is, a 
task requiring a student to repeat a set of numbers 
read to him (e.g., 9, 4, 17, 8) in precisely the re¬ 
verse order (i.e., 8, 17, 4, 9). 

Working memory seems to function less effi¬ 
ciently as a screening measure than as a variable 
that adds precision to a set of other predictors 
(e.g., Baker et al., 2002; Locuniak & Jordan, 
2008). The relationship between working mem¬ 
ory and number sense appears to be complex. We 
believe that working memory relates to mathe¬ 
matics proficiency because in mathematics, stu¬ 
dents are asked not only to remember, but also 
mentally “juggle” several bits of abstract informa¬ 
tion (e.g., basic facts, positions of numbers on a 
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mental line, computational procedures, etc.). Stu¬ 
dents with weak number sense need to rely more 
on working memory. Students weak in both areas 
are likely to struggle and future research should 
target optimal intervention strategies for this 
group of students. 

Student Engagement and 
Attentiveness 

The relationship between teacher ratings of a stu¬ 
dent’s attentiveness and at-risk status in mathe¬ 
matics also appears to be consistent (Bodovski & 
Farkas, 2007; DiPerna, Lei, & Reid, 2007; Fuchs 
et ah, 2007). Recent research on early identifica¬ 
tion of students with difficulties learning mathe¬ 
matics suggests that measures of a student’s 
attentiveness during academic instruction is a 
solid predictor of future mathematics achieve¬ 
ment. Utilizing the Early Childhood Longitudi¬ 
nal Study (ECLS-K) database, a number of 
researchers have begun to explore the relationship 
between student engagement and achievement 
growth in mathematics. 

Bodovski and Farkas (2007) examined the 
growth rates from the fall of kindergarten to the 
spring of third grade with a nationally representa¬ 
tive sample of 13,043 students. ECLS-K mea¬ 
sured student engagement by asking teachers to 
complete a six-item survey rating each student’s 
attention, persistence with tasks, and demonstra¬ 
tion of learning independence. As expected, 
higher achieving students displayed greater levels 
of engagement. A secondary analysis of the lowest 
quarter of students found that within this group, 
students with the greatest need showed lower 
rates of achievement growth and engagement over 
time. Across all grade levels tested (i.e., K—5), the 
attention measure contributed a unique propor¬ 
tion of the variance in outcomes, beyond initial 
skill status. This effect was striking because the 
impact of student engagement was greater than 
time spent on instruction and the effect showed 
the greatest impact for the lowest achieving stu¬ 
dents. This finding suggests that interventions for 
students with problems in mathematics might se¬ 
riously consider adding a component that pro¬ 
motes attentiveness to academic tasks and 
activities. 


DiPerna et al. (2007) also examined the rela¬ 
tionship between student engagement and 
achievement growth in mathematics using the 
ECLS-K database from kindergarten entry to 
third-grade exit with a nationally representative 
sample of 3,240 students. Teachers’ ratings of stu¬ 
dents’ academic attentiveness at the beginning of 
kindergarten played a role in predicting subse¬ 
quent mathematics achievement, with predictive 
validity correlation coefficients ranging from .28 
to .35. Fuchs et al. (2007) also found that teach¬ 
ers’ appraisal of attentiveness was a significant 
predictor of future mathematics achievement. 

Although these correlations are not nearly as 
large as those of the mathematics screening mea¬ 
sures, they suggest that ratings of attentiveness 
could be added to a screening battery to create 
some type of composite score. They also suggest 
that many students requiring some type of early 
intervention in mathematics might also struggle 
with maintaining attention to academic tasks for 
sustained amounts of time. 

As researchers build more sophisticated mod¬ 
els for early identification using more extensive 
batteries (e.g., Fuchs et al., 2007; Locuniak & 
Jordan, 2008), we may begin to generate more 
precise screening methods. The more sophisti¬ 
cated models may take into account measures of 
working memory, attentiveness, and perhaps 
other cognitive variables. 

PREDICTIVE VALIDITY AND 
DIAGNOSTIC CLASSIFICATION 
ACCURACY 

Predictive Validity 

The main method for evaluating the measures 
discussed in this article was to examine predictive 
validity using Pearson correlations over the course 
of a school year. (A small number of studies were 
conducted over a longer timeframe, e.g., 2 to 3 
years. We discuss these separately.) All of these 
studies involved multiple proficiency measures or 
studies that used diagnostic utility statistics. 
When designing screening measures in mathe¬ 
matics, a critical variable to consider is the extent 
to which performance on those measures relates 
to later performance in mathematics. For exam- 
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pie, a student’s score on a first-grade screening 
measure would need to accurately predict diffi¬ 
culty in mathematics at the end of first grade, and 
ideally performance a year later as well. Assess¬ 
ments that show evidence of predictive validity 
can aid in instructional decision making. If evi¬ 
dence indicates that a score below a certain 
threshold on a kindergarten or beginning of first- 
grade measure of mathematics predicts later prob¬ 
lems, then schools and teachers can use that 
information to allocate resources for instructional 
or intervention services to those students. 

This approach provides reasonable estimates 
of how well the distribution of scores on the 
screener corresponds with the distribution of stu¬ 
dents’ performance on a lengthier achievement 
test in mathematics administered later in the 
school year. However, for screening measures, we 
are more interested primarily in one group of stu¬ 
dents—the 15% to perhaps 30% who are at risk. 
Ultimately, a screening measure rises or falls based 
on how well it is able to pinpoint which students 
need additional help. For that reason, recent 
screening research has begun to use measures of 
classification accuracy (e.g., Clarke et ah, 2011; 
Jordan, Glutting, Ramineni, & Watkins, 2010; 
Mazzocco & Thompson, 2005; Seethaler & 
Fuchs, 2010). Because this information is rela¬ 
tively new to special education research (this is 
less true for psychology research), and because the 
concepts—classification accuracy, specificity, sen¬ 
sitivity, and ROC curves—are relatively new and 
often not well understood, we explain them 
below. 

Classification Accuracy 

Classification accuracy refers to the degree to 
which the screener provides correct classifications 
of children who require additional assistance in 
mathematics. There are two types of mistakes a 
screener or screening battery can make. The first 
is to miss students who truly need help, that is, to 
create false negatives. To assess this risk, we report 
on the screener’s sensitivity. The second type of 
classification mistake is to falsely identify students 
as needing help when in fact they do not require 
additional instruction or assistance. This group of 
students is called false positives. To assess this risk, 
we report on the screener’s specificity. 


Earlier, the field focused heavily on high sen¬ 
sitivity, and that remains a major concern in 
much of the published research (e.g., Seethaler & 
Fuchs, 2010). However, the response to interven¬ 
tion (RTI) research community has become in¬ 
creasingly aware of the phenomenal waste of 
resources that comes with false positives, and 
there has been more focus on specificity (see, for 
example, Silberglitt & Hintze, 2005). 

Sensitivity: Ensuring That No One 
“Falls Through the Cracks” 

Jordan provides a practical definition for the term 
sensitivity: “Sensitivity is the proportion of indi¬ 
viduals with a disorder (e.g., individuals with low 
achievement or learning disabilities) who are cor¬ 
rectly identified by a positive test finding” (Jordan 
Glutting, Ramineni, & Watkins, 2010, p. 184). 
Researchers have used various working definitions 
of how to determine what it means to “require ad¬ 
ditional assistance” in an academic area, realizing 
that any of these operational definitions are, at 
best, educated guesses. 

There is no common standard for determin¬ 
ing what the term “at risk” or “would benefit 
from intervention” means. The problem is hardly 
unique to education. Public health officials grope 
with this issue as they determine categories such 
as “at risk for heart disease” or “at risk for a 
stroke,” and change criteria to reflect evolving 
definitions of at risk. 

Some researchers have focused less on RTI 
and more on valid early identification of students 
who possess a disability in mathematics. These re¬ 
searchers often use a criterion of the 10th per¬ 
centile (e.g., Fuchs et ah, 2007; Morgan et ah, 
2009), reasoning that about one student in ten 
has a learning disability in mathematics or is at 
strong risk for developing a disability in mathe¬ 
matics in the future. Others focus on using a 
screening measure or battery to determine which 
students might benefit from an intervention. 
These very different criteria for classifying a stu¬ 
dent as “at risk” have a profound impact on any 
discussion of sensitivity, and the professional liter¬ 
ature often does not make this distinction clear. 

During the initial development of screening 
measures in mathematics and reading, researchers 
often argued that it was most important to “catch 
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kids early,” and they feared letting any at-risk stu¬ 
dents through the screening battery. Thus, they 
often set a high criterion, such as the 40th per¬ 
centile or below, to avoid missing any students 
who might have a true problem. However, there 
are costs to this approach. The formal means for 
assessing those costs is specificity. 

Specificity and the Concern 
for Wasted Resources 

Often neglected but equally important is knowl¬ 
edge of whether students recommended for Tier 2 
intervention would succeed without being pro¬ 
vided with any particular intervention. We refer 
to this as the specificity of the screener. In the zeal 
to make sure that we do all we can to provide 
early intervention to students in reading and 
mathematics, this topic has been neglected until 
recently (Gersten et ah, 2009; Silberglitt & 
Hintze, 2005). At this time, there is no consensus 
or commonly used convention for reporting what 
it means to “succeed without any additional inter¬ 
vention.” Deciding what performance criterion 
will be used is a key issue that is not easy to re¬ 
solve. 

Weak specificity indicates that a screening 
measure or battery is over-identifying students 
and thus providing services to students who do 
not need them. Students who are misclassified as 
needing help even though they don’t are called 
false positives. False positives are problematic be¬ 
cause resources may be wasted (Gersten et ah, 
2009) by providing extra intensive intervention to 
students who do not need such help, as often 
happens in the field of reading (e.g., Jenkins & 
O’Connor, 2002). For schools with finite re¬ 
sources, this is particularly vexing. Resources 
spent on providing interventions to students who 
do not need them are thus not available to be 
spent on other valuable services. False positives 
are not only taxing on schools but can also be 
detrimental to parents and students . . . given 
the generally ‘chaotic’ nature of early achievement 
and the increased possibility of falsely identifying 
students as being ‘at-risk’ when they are merely 
distracted, anxious, or unfamiliar with the testing 
protocols” (Bryant et ah, 2011, p. 9). Students 
and parents may suffer adverse effects of thinking 


the child has a disability when in fact the student 
has no disability whatsoever. 

Can We Balance These Two 
Concerns? ROC Curves as 
a Potential Tool 

A measure with perfect sensitivity ensures that all 
students who require intervention receive extra 
support. A measure with perfect specificity en¬ 
sures that schools do not spend resources on stu¬ 
dents who do not need extra support. However, 
measurement in education, medicine, psychology, 
and most human endeavors is far from perfect 
and consists of a series of compromises and bal¬ 
ances. For screening, we need to balance two 
goals - accurately detecting which students re¬ 
quire early intervention (sensitivity) and detecting 
only those students who require additional help 
(specificity). 

Here we face a bit of a paradox. The more we 
increase sensitivity, the more we try to ensure that 
we do not miss any students who might need in¬ 
tervention, but in doing so the more we decrease 
specificity. Thus, development and refinement of 
effective screening measures requires a delicate 
balance. The selection of a cut score (the number 
at which a score at or above classifies the student 
as not at risk and a score below classifies the stu¬ 
dent as at risk) affects both sensitivity and speci¬ 
ficity in a reciprocal fashion (e.g., setting the cut 
score to have higher sensitivity leads to lower 
specificity). 

In the past, most researchers simply skirted 
the issue and reported Pearson correlations of pre¬ 
dictive validity. As a field, we are only beginning 
to develop conventions for reporting on sensitiv¬ 
ity and specificity. One widely used tool in evalu¬ 
ating the utility of a diagnostic instrument is the 
ROC curve (see Table 3 for a summary of studies 
utilizing ROC analyses). Although relatively new 
to the area of mathematical screening, the classifi¬ 
cation accuracy of diagnostic tests has long been 
of interest in medicine. “By systematically using 
all possible cut scores of a test and plotting the 
true-positive rate (i.e., sensitivity) against the 
false-positive rate (i.e., 1-specificity) for each cut 
score, diagnostic validity can be displayed for the 
full range of the test’s scores” (Jordan, Glutting, 
Ramineni, & Watkins, 2010, p. 184). In essence, 
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the ROC curve plots sensitivity vs. 1-specificity 
and illustrates the inverse relationship between 
sensitivity and specificity; changing the cut score 
to improve one will lower the other. 

Example of an ROC Analysis 

If a researcher chose a score on a screener that 
would correctly identify every child who scored 
below a given criterion on subsequent mathemat¬ 
ics achievement tests (e.g., the 10th percentile for 
MLD or the 25th for at-risk status), they would 
invariably include many false positives, thus re¬ 
sulting in unacceptably low specificity. ROC anal¬ 
yses can help inform these decisions. 

ROC analyses select the cut score mathemat¬ 
ically to maximize sensitivity and specificity in a 
balanced way. In an ROC analysis, the outcome is 
specified a priori. For example, Fuchs et al. 
(2007) specified a score below the 10th percentile 
on various mathematics achievement measures as 
representing MLD. Then, using the AUC in the 
ROC analysis, which provides a metric for accu¬ 
racy of group discrimination based on the a priori 
cut score, they were able to identify how accu¬ 
rately the number sense measure discriminated 
between the MLD and non-MLD groups based 
on their performance. By examining the ROC 
curve, one can actually examine the impact of 
various cut scores on accurate group identification 
(e.g., MLD or at risk for MLD). However, al¬ 
though an ROC analysis increases sensitivity, it 
does not concomitantly increase specificity. The 
role of the cut score is an integral one. Recent re¬ 
search demonstrates how ROC analyses can assist 
researchers in development of accurate measures, 
but there are trade-offs that necessitate acknowl¬ 
edgement. 

Contemporary Research Using 
ROC Analyses: The Search for 
an Ideal Balance 

Earlier studies that pioneered the use of ROC 
analyses typically reported the AUC and noted 
whether it was higher than .80. If so, they re¬ 
ported that the AUC was “good” following a 
standard of clinical significance, with .80 to .89 
being “good” and .90 to 1.00 being “excellent” 
(Cicchetti, 2001). In contrast, Seethaler and 


Fuchs (2010) perform a much more sophisti¬ 
cated, useful analysis. In addition to reporting 
sensitivity, specificity, and the AUC, the authors 
provide the number of students incorrectly iden¬ 
tified by the screener as MD (false positives), the 
number of students incorrectly identified as non- 
MD (false negatives), the number of students 
correctly identified as MD (true positives), and 
the number of students correctly identified as 
non-MD (true negatives). They did this analysis 
separately for students classified as MD-concep- 
tual and MD-procedural. Students received an 
MD-conceptual designation if they scored below 
the 16th percentile on the EMDA Math Reason¬ 
ing subtest and MD-procedural if they scored 
below the 16th percentile on the EMDA Numer¬ 
ical Operations subtest. As can be seen in Table 
3, Seethaler and Fuchs report sensitivity of 
89.8% for the quantity discrimination (Clarke & 
Shinn, 2004) screener in predicting MD-proce- 
dural status. However, the number of students 
incorrectly identified as MD was also high (93). 
In other words, despite high values of sensitivity, 
nearly half of the students testing positive were 
actually false positives, resulting in a specificity of 
32.1%. Thus, simply relying on sensitivity can 
produce misleading conclusions about the diag¬ 
nostic accuracy of a test. 

Clarke et al. (2011) performed a similar anal¬ 
ysis on the first-grade version of a newly designed 
measure called easyCBM (http://www.easycbm 
.com/). In this case, they used a criterion devel¬ 
oped by Silberglitt and Hintze (2005). These 
researchers specify a specific means for use of an 
ROC curve. This entails (a) only including cut 
scores so that both specificity and sensitivity are 
equal or greater than .70, and (b) performing an 
intricate titration process so that sensitivity is in¬ 
creased as much as possible while specificity re¬ 
mains at least .70. This process is described in 
detail in Clarke et al. (2011) and demonstrates a 
promising method for balancing the need to cor¬ 
rectly identify as many students who need help as 
possible while not casting such a wide net that 
students who would do fine without help are 
given costly assistance. 
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DISCUSSION 

The recent report on early childhood mathemat¬ 
ics learning (National Research Council, 2009) 
concluded: 

Further exploration is needed to better un¬ 
derstand what early number competencies 
are predictive of future success in mathemat¬ 
ics. Such research can help identify children 
at risk for learning difficulties or disabilities 
in mathematics . . . [and] develop targeted 
interventions for such children and test their 
effectiveness, (p. 350) 

The authors were primarily addressing 
preschool, but the need is as critical for students 
in the primary grades. The longitudinal studies 
documenting the persistence of mathematics dis¬ 
abilities and difficulties in learning mathematics 
from kindergarten to the upper elementary grades 
create a compelling case for future research on de¬ 
velopment and refinement of valid universal 
screening measures for students in the primary 
grades. 

In the remainder of this section, we highlight 
several areas for future research. We also note 
pragmatic issues faced by school personnel at¬ 
tempting to implement some of the screening 
measures discussed earlier. 

Grappling With the Concept 
of Risk Status 

Determining risk status is as much an art as a sci¬ 
ence. This is true in all fields. However, it remains 
a vexing issue in the area of mathematics for sev¬ 
eral reasons. A source of confusion in the field is 
that criteria for determining at-risk status in 
mathematics have varied from below the 25th 
percentile on a normed mathematics measure 
(Locuniak & Jordan, 2008) to below the 10th 
percentile (Fuchs et ah, 2007; Morgan et ah, 
2009). In the first case, the researchers mirrored 
what a school district might do: cast a relatively 
broad net to ensure that all students who may 
need intervention receive it. However, as districts 
have learned from experiences with RTI in read¬ 
ing, and as researchers have begun to consistently 
note, there are real drawbacks to casting too 
broad a net. For one thing, a good deal of time 
and money is wasted because intervention is pro¬ 
vided to students who would do fine without it. 


In addition, resources that are usually sorely 
needed are pulled away from intermediate grades 
and middle school. 

Potential Limitations of ROC 
Analyses 

Because there remains no clear definition of what 
constitutes “at risk,” decisions made regarding at- 
risk classification are often complex. Normally, re¬ 
searchers select one specific cut score and students 
who do not achieve that predetermined score are 
determined likely to be at risk in the area targeted 
by the assessment. However, researchers have a lot 
of discretion in selecting the precise score to use. 
The decisions have a significant impact on the 
classification accuracy of any screening instru¬ 
ment selected and the number of students identi¬ 
fied for additional support in mathematics. The 
recent research focusing on classification accuracy 
using ROC analyses to evaluate the sensitivity and 
specificity of a screening system is certainly a step 
forward over simply presenting a predictive valid¬ 
ity correlation coefficient. However, ROC analy¬ 
ses are based on various, usually unstated, 
mathematical assumptions (i.e., normality of 
noise [or error] distributions). A recent article by 
VanDerHeyden (2011) provides one of the most 
probing discussions of the limitations of sensitiv¬ 
ity and specificity for both practitioners and re¬ 
searchers using these analyses. The author 
proposes a more probabilistic, Bayesian approach, 
inspired by the work of Robyn Dawes (1962). 
VanDerHeyden highlights the importance of con¬ 
sidering positive predictive power, that is, the 
probability that a score below benchmark is an ac¬ 
curate indicator of risk. She also provides impor¬ 
tant cautions: 

Sensitivity and specificity offer little informa¬ 
tion about the value of a test finding for rul¬ 
ing-out or ruling-in a condition. Generally, 
predictive power quantifies the value of a test 
finding for ruling in (positive predictive 
power) and ruling-out (negative predictive 
power) a condition in a way that is easily in¬ 
terpreted and used. (VanDerHeyden, 2011, 
p. 342) 

In other words, predictive power can tell a 
teacher or psychologist whether a particular stu¬ 
dent is really at risk for learning problems in 
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math. However, positive and negative predictive 
power estimates are problematic because the field 
has no precise definition of what it means to be 
“at risk” for a learning problem in math or to pos¬ 
sess a math learning disability. 

Advances in the Use of ROC 
Analyses 

Ultimately, we envision that ROC can help us se¬ 
lect a benchmark, but we then need to ensure that 
our cut score demonstrates adequate validity in 
terms of consequential validity (Gersten, Keating, 
& Irvin, 1995; Messick, 1980), that is, use of a 
given screening procedure must be linked to in¬ 
creases in the mathematics performance of stu¬ 
dents at the lower end of the distribution in 
math. We need to study the impacts of imple¬ 
menting specific screening measures or batteries 
and specific benchmarks on practice, using occa¬ 
sional case study and descriptive research. 

The concern for false positives (low speci¬ 
ficity), especially in the early grades, has received 
increased attention in both reading and mathe¬ 
matics. Recently, Siegel, Fuchs, O’Connor and 
Vaughn (2011) experimented with a method for 
reducing this rate in the primary grades. Although 
the impact was not statistically significant, this 
does seem to be a promising method for future 
study. Students who score below the cut score in 
the fall are not immediately placed in a Tier 2 in¬ 
tervention. Rather, they continue to receive only 
typical classroom instruction, but their progress is 
monitored closely (e.g., weekly) for a period of 6 
to 8 weeks. Only those with unacceptably low 
rates of progress receive Tier 2 interventions. In 
our view, this type of approach warrants further 
research and is worthy of serious consideration. 

We are only beginning to understand how to 
use the concepts of sensitivity, specificity, and 
classification accuracy in our research and how 
these analyses can provide critical information for 
districts or schools making decisions about what 
type of screening measures to use. For example, a 
district may decide to select a more efficient 
screening measure over a more comprehensive 
(less efficient) measure if the shorter measure 
demonstrates similar rates of accurate classifica¬ 
tion. Earlier studies merely reported that AUC 
was over .80 and concluded it was good based on 


clinical significance criteria and stopped there. 
The contemporary research by Seethaler and 
Fuchs (2010) and Clarke et al. (2011) seems to be 
a timely advance in the increasingly sophisticated 
use of ROC to balance the competing demands of 
specificity and sensitivity—of wasting resources 
versus letting students fall through the cracks. 
Despite its limitations, ROC remains a useful tool 
for establishing criteria for cut scores and can now 
be used by individual school districts (or individ¬ 
ual schools) if appropriate technical support is 
provided. 

Our Perspective on the Future 

With the advent of the Common Core State 
Standards (http://www.corestandards.org/the- 
standards/mathematics) and increased use of tech¬ 
nology, notions of efficiency change. Whereas 
even 5 to 10 years ago, the main consideration for 
efficiency was how long a test took to administer 
and score, with the use of technology, scoring can 
be done almost automatically. Testing, at least 
beginning sometime in first grade, would not 
require high degrees of adult supervision. Most 
importantly, use of IRT allows technology to cali¬ 
brate screening measures more precisely. 

We are only beginning to understand 
how to use the concepts of sensitivity, 
specificity, and classification accuracy in 
our research and how these analyses can 
provide critical information for districts 
or schools making decisions about what 
type of screening measures to use. 

At this point in time, we understand a good 
deal more about what comprises a comprehensive 
assessment battery, but are less certain of the ele¬ 
ments of an efficient assessment battery. A crucial 
criterion for use of a screening measure is effi¬ 
ciency. Jordan et al. (2008) have developed an 
efficient 33-item untimed screening measure that 
has good predictive validity, and the NKT is a 
relatively efficient screening tool, typically taking 
10 to 15 min to administer. Both of these are far 
more comprehensive than measures of one 
component of number sense, such as magnitude 
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comparison. However, the realities of universal 
screening require use of the most efficient 
measures. 

It is critical to note that the development of 
mathematical thinking is broader than the set of 
skills assessed by any one of the measures de¬ 
scribed previously, which focus almost exclusively 
on the domain of numbers. Some challenges the 
field will face will be to explore the role of student 
performance in other critical areas (e.g., geome¬ 
try), to determine how to measure performance in 
those domains, and to determine their relation¬ 
ship to proficiency in algebra and other advanced 
mathematical topics (G. J. Duncan, personal 
communication, May 7, 2011). As our under¬ 
standing of mathematical development advances, 
so too should our design of screening instruments 
that reflect the complexity of mathematics. How¬ 
ever, the insights about the importance of the so¬ 
phistication of an individual’s mental number line 
as a sensitive snapshot of mathematical develop¬ 
ment made by Okamoto and Case (1996) remain 
robust, as the evidence described in this article 
demonstrates. 

Each year in school brings about new chal¬ 
lenges for students and new material to master in 
order to further their mathematical understand¬ 
ing and build a foundation for future content. 
Because the demands of the mathematics curricu¬ 
lum continue to change over the years, it is possi¬ 
ble that certain students may initially learn math 
at acceptable levels only to experience problems 
once content moves to a more abstract level (e.g., 
with the introduction of decimals, improper frac¬ 
tions, ratios and proportions, negative numbers). 
Therefore, as in the reading field (Scarborough, 
2001), we will likely see students whose perfor¬ 
mance in mathematics is acceptable in the pri¬ 
mary grades, but deteriorates in later grades. 

Future research needs to address several criti¬ 
cal areas. The first is valid screening measures for 
Grades 3 and above, using IRT and important 
policy frameworks such as the Common Core 
Standards as a basis. Another potential research 
area is measurement of skills related to geometry 
and an examination of whether there are precur¬ 
sors of geometry proficiency that are different 
than those of proficiency with number concepts 
and operations. 


Last, we would be remiss if we did not em¬ 
phasize that the collection of screening data in 
and of itself does not change student outcomes. 
Any advances that schools make in screening stu¬ 
dents in mathematics must occur alongside efforts 
to improve instructional practices and to develop 
effective interventions. The body of research on 
this topic is sparse, but expanding rapidly. 

Any advances that schools make 
in screening students in mathematics 
must occur alongside efforts to improve 
instructional practices and to 
develop effective interventions. 
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